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(54) Multiple processor cache control system. 

(57) A tightly coupled, cache coherent computer 
system (100) utilizes a crossbar switch (182) to 
increase memory system and memory bus 
bandwidth, thus eliminating bottlenecks. Local 
directories (118,136,154,172), one for each node 
(174,176,178,180) keep track of valid data and 
eliminate bottlenecks associated with cache 
coherency. The system (100) is also scalable, 
easily allowing expansion by adding further 
processors, for example. 
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This invention relates to tightly coupled computer 
systems. 

Multiple processor computer systems have sev- 
eral processors to increase their throughput. Broadly 
speaking, tightly coupled computer systems are those 5 
in which several processors, usually on separate prin- 
ted circuit boards, operate nearly independently of 
each other, while sharing a common memory system, 
this is in contrast to loosely coupled computer sys- 
tems, also employing multiple processors, in which 10 
the processors do not share a common memory sys- 
tem. 

To improve system performance, it is known to 
use caches (secondary, high speed memories) in 
conjunction with the shared main slower memory. 15 
Each processor may have associated therewith a 
cache memory which temporarily stores copies of the 
data that is being accessed by that processor. 

US Patent No. 4,622,631 discloses a tightly coup- 
led multiple processor computer system which pro- 20 
vides for data coherency, a computer system being 
termed coherent if the data that are being accessed 
by a processor are always the last data written to the 
address of that data. The processors and shared main 
memory are coupled to a bus, and each processor 25 
contains a cache memory. According to this known 
system, only one of the plurality of processors and 
addressable main memory is a current owner of an 
address of a block of data, the current owner having 
the correct data for the owned address, and the own- 30 
ership of an address being dynamically changeable 
among the addressable main memory and the 
plurality of processors. 

US Patent No. 4,755,930 discloses a caching 
system for a shared bus multiprocessor which 35 
includes several processors each with its own private 
cache memory. A cache coherency scheme is 
utilized, wherein cache coherence requires that if a 
processor modifies a copy of the memory location in 
its cache, that modification must migrate to the main 40 
memory orto all the other caches. Alternatively, all the 
other caches must be invalidated. 

It is found that known tightly coupled computer 
systems, having a cache coherency provision, tend to 
give rise to system bottlenecks, that is, locations in the 45 
system wherein, because of limited information hand- 
ling capabilities, overall system information transfer 
rates are reduced. 

One type of bottleneck in prior art systems is 
caused by limited bandwidth in non-parallel memory so 
systems. Another type of bottleneck in prior art sys- 
tems occurs in the memory bus (between processor 
and memory), and becomes exacerbated as more 
processors are added to the system. Finally, the 
implementation of cache coherency schemes often 55 
creates bottlenecks of its own. 

It is an object of the present invention to provide 
a tightly coupled multiple processor computer system 



having cache coherency, wherein bottleneck prob- 
lems are minimized, enabling high information trans- 
fer to be achieved. 

Therefore, according to the present invention, 
there is provided a tightly coupled system having 
cache coerency, characterized by: a plurality of nodes 
having respective processing means and respective 
memory means connected thereto; a plurality of 
caches associated with said processing means; and 
crossbar switching means adapted to directly connect 
any one of said nodes to any other of said nodes. 

One embodiment of the invention will now be des- 
cribed by way of example, with reference to the 
accompanying drawing, in which the sole figure 
shows a schematic block diagram of a tightly coupled 
computer system according to the invention. 

The sole drawing figure shows the preferred 
embodiment of a multiple-board, tightly coupled com- 
puter system 100, which uses a crossbar switch or 
crossbar 182 to interconnect 4 nodes 174, 176, 178 
and 180, although the system 100 is by no means 
limited to 4 nodes. Each node 174, 176, 178 and 180 
has coupled thereto: several processor (P) boards 
102-108, 120-126, 138-144, and 156-162, respect- 
ively; a connect-disconnect memory bus 110, 128, 
146 and 164, respectively; an input/output (I/O) board 
112, 130, 148 and 166, respectively; a memory sys- 
tem comprised of two memory (M) boards 114-116, 
132-134, 150-152 and 168-170, respectively; and a 
lookaside cache coherency directory (D) 118, 136, 
154 and 172, respectively. The processors P are 
associated with respective caches (notshown), which 
may be contained in the processors. 

The crossbar 182, by itself, is well known in the 
art As is conventional, the crossbar 182 allows direct 
communication between any two nodes, in response 
to control logic received from the nodes. For example, 
node 174 can be directly and exclusively connected 
to node 178-or to any of the other nodes, for that mat- 
ter-without having to use a common bus shared by all 
of the nodes. The same holds true for any other des- 
ired connections between the four nodes 174, 176, 
178 and 180. 

The crossbar 182 is implemented with one or 
more application specific integrated circuit (ASIC) 
chips using complementary metal-oxide-semiconduc- 
tor (CMOS) technology. 

The remainder of the description concerns the 
implementation of the crossbar 182 in the tightly coup- 
led, cache coherent system 100. With this approach, 
the memory bus (110, 128, 146 and 164) bandwidth 
is increased, since communication between nodes 
involves different (and separate) paths rather than a 
single path. 

The operation of the lookaside cache coherency 
directories 118, 136, 154 and 172 will now be des- 
cribed. Each cache coherency directory D services 
only the memory on its corresponding node, and 
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determines from a stored processor ID (an 8-bit 
pointer) whether a memory location requested con- 
tains valid or invalid data. Only one agent (processor 
P or memory M) in the system 100 can be the owner 
of the valid copy of a cache line, and there is no shar- s 
ing of data between caches. If the processor ID is 
zero, for example, the memory on that node is the only 
owner of the valid copy of the cache line. If the pro- 
cessor ID is not zero, the memory on that node does 
not contain a valid copy of the cache line, and the pro- 10 
cessor ID indicates which processor P is the owner of 
the only valid copy of the cache line. 

Once the memory on a node serves the data to a 
requesting processor P, the corresponding directory 
D on that node records the processor ID for the cache 15 
line address of the requesting processor P. The 
requesting processor P, having received the data, 
becomes the exclusive owner of the cache line from 
whence the data emanated. Should another pro- 
cessor P request the previously mentioned cache 20 
line, the directory D connected to the node from which 
the cache line originated requests the return of the 
cache line from the owner. The owner then returns the 
cache line to the directory D and changes its state 
from valid to invalid. At this time, the directory D 25 
updates its stored pointer ID to indicate the new exc- 
lusive owner (requesting processor) of the valid data, 
while sending the valid data to the new owner, and it 
also writes a copy of the valid data into the memory 
on its node. It should be noted that when a device 30 
requests a memory address via a memory system bus 
such as bus 110, the device releases the bus once a 
request has been made, to allow other devices to 
make requests over the bus before requested data is 
received by the requesting device. 35 

A cache may replace an unmodified or a modified 
cache line. Where an unmodified cache line is rep- 
laced, since such replacement is not notified to mem- 
ory, the relevant directory D is unaware of the 
replacement. However, if the replaced cache line is 40 
subsequently requested, the processor P which has 
replaced the cache line in its cache returns a "null" 
message to the directory, which indicates to the direc- 
tory that the true owner of the requested cache line is 
the memory from which the cache line originated, and 45 
the directory responds by adjusting its pointer accord- 
ingly. 

When a cache replaces modified data, the mod- 
ified data is returned to the directory D connected to 
its home memory, and that directory D changes its so 
processor ID to zero, while writing the data into mem- 
ory. 

Each directory D snoops for read and write 
requests (only to its home memory) from the I/O 
boards 112, 130, 148 and 166, in the same manner as 55 
for each processor P. The same procedure for hand- 
ling a request from a processor P is followed when 
handling a request from an I/O board. 



For lock memory operations (where a block of 
memory is locked for use by a single processor), the 
semaphores (control flags) needed for the operations 
are stored in the home directories rather than in the 
processor caches. This provides fast access, but 
does not create a high level of traffic, since the 
semaphores are a low percentage of all operations 
occurring throughout the computer system 100. 

Read operations initiated by processors P from 
different nodes cause other directories D to have 
copies of the semaphore. To initiate a lock operation, 
a requesting processor P issues a read-with-intent-to- 
modify (RIM) message that passes through its cache 
to the home directory of the cache line, which RIM 
causes the home directory to broadcast an invalid 
message to all directories D. 

Once a lock operation has begun, the home direc- 
tory does not allow access to the semaphore until the 
write operation associated with the lock operation is 
complete. The home directory issues retry messages 
to any request to access the semaphore occurring be- 
tween the read and completion of the write of the lock 
operation. 

The advantages of the preferred embodiment are 
several. The use of the crossbar 182 approximately 
increases the memory bus bandwidth available by a 
factor of n, where n represents the number of nodes. 
(Likewise, the frequency of use of each memory bus 
is approximately reduced by a factor of 1/n.). The use 
of a distributed memory system eliminates memory 
system bottlenecks associated with many prior art 
tightly coupled computer systems. By using distri- 
buted directories (as opposed to carrying out cache 
coherency in the processor caches) which handle 
cache coherency, each directory only snoops for its 
node rather than for the entire system, which elimi- 
nates cache coherency bottlenecks. Also, because of 
the distributed cache coherency directories, the sys- 
tem is readily scalable, easily allowing the addition of 
more processors P and I/O interfaces. By not sharing 
data between caches, the need for broadcasting and 
validating to all caches is eliminated, which prevents 
overloading of the crossbar and nodes. 

Variations and modifications to the present inven- 
tion are possible given the above disclosure. For 
example, the present invention is not limited in scope 
to the same number of nodes, processors, or memory 
systems illustrated in the description of the preferred 
embodiment 

Also, the system 100 could be modified so that 
the directories D are replaced by a memory resident 
cache coherency algorithm in each memory system, 
although the memory system in this case must be 
designed for the largest anticipated configuration. 
This alternate design still provides high memory bus 
and memory system bandwidth, but is not as easily 
scalable as the preferred embodiment 
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Claims 

1. A tightly coupled system having cache cohe- 
rency, characterized by: a plurality of nodes 
(174,176,178,180) having respective processing 5 
means (102-1 08; 120-1 26; 138-1 44; 156-1 62) and 
respective memory means (1 14,1 16;1 32,1 34; 
150, 152; 168, 170) connected thereto; a plurality 

of caches associated with said processing means 
(102-108;120-126;138-144;156-162); andcrossbar w 
switching means (182) adapted to directly con- 
nect any one of said nodes (174,176,178,180) to 
any other of said nodes (174,176,178,180). 

2. A tightly coupled computer system according to 15 
claim 1, characterized in that said nodes 
(174,176,178,180) have respective lookaside 
directories (118,136,154,172) coupled thereto, 
each said lookaside directory (e.g. 118) being 
adapted to control cache coherency for the mem- 20 
ory means (e.g. 114,116) connected to the 
associated node (e.g. 174). 

3. A tightly coupled computer system according to 
claim 1 or claim 2, characterized in that each said 25 
lookaside directory (e.g. 1 18) is adapted to main- 
tain an identification of the location of valid 
cached data originating from the associated 
memory means (e.g. 114,116). 

30 

4. A tightly coupled computer system according to 
any one of the preceding claims, characterized in 
that each node (e.g. 174) is connected to an 
associated memory system bus (e.g. 110) to 
which associated data requesting devices, 35 
including the associated processing means (e.g. 
102-108) are connected, and in that each data 
requesting device requesting a memory address 

via an associated memory system bus (e.g. 110) 
releases that memory system bus (e.g. 110) once 40 
a request has been made by a first requesting 
device, to allow other data requesting devices to 
make requests over the memory system bus bef- 
ore requested data is received by said first 
requesting device. 45 

5. A tightly coupled computer system according to 
any one of claims 2 to 4, characterized in that 
lookaside directory (e.g. 1 18) is adapted to store 
semaphore signals utilized in memory lock oper- so 
ations. 

6. A tightly coupled computer system according to 
anyone of the preceding claims, characterized in 

that said crossbar switching means (182) is 55 
implemented by an integrated circuit chip. 
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